Skip to content

chore(buffers): Rework counter/gauge increment handling#23542

Closed
bruceg wants to merge 16 commits intomasterfrom
bruceg/rework-buffer-counters
Closed

chore(buffers): Rework counter/gauge increment handling#23542
bruceg wants to merge 16 commits intomasterfrom
bruceg/rework-buffer-counters

Conversation

@bruceg
Copy link
Member

@bruceg bruceg commented Aug 6, 2025

Summary

This is a step-by-step refactoring of the changes made in #23453 and #23507 in order to remove the need for the shared BUFFER_COUNTERS dashmap. While the whole change is rather large, I attempted to keep each commit from altering externally-visible behavior, with the exception of additional labels in the buffer create gauges. As such, reviewing each individual commit is likely more useful than reviewing the whole code change.

However, this leaves a bit of a conundrum: If we look at the changes to lib/vector-buffers/src starting immediately before the above changes (have to manually scroll down to the buffer source files, the direct GitHub link doesn't quite work), the result doesn't entirely make sense:

  1. The metrics have the buffer_id added to disambiguate metrics with the same ID but different stages. I thought, however, that the buffers would have been created in the context of a component and, as such, would already have the component ID in their tags. If not, this is a real bug, maybe even the real bug. Since the buffer_id corresponds to a component ID, maybe that particular tag name is less than ideal.
  2. The counters are updated using a saturating add which prevents them from overflowing, which is good, but does nothing to deal with negatives which were the heart of Vector buffer metrics report negative values #21702 that prompted the above two PRs.
  3. The gauges have their values set instead of incremented from the same values. This is the part that really doesn't make sense to me, but is an outflow of all the refactoring.

Vector configuration

N/A

How did you test this PR?

Tested using the unit tests, all of the above should be behavior-preserving.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • cargo fmt --all
      • cargo clippy --workspace --all-targets -- -D warnings
      • cargo nextest run --workspace (alternatively, you can run cargo test --all)
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run cargo vdev build licenses to regenerate the license inventory and commit the changes (if any). More details here.

bruceg added 15 commits August 6, 2025 12:14
The counters that come in from the internal events can never contain negative
values, so handling them is completely unnecessary.
Note that the tests on the state of the gauges have been dropped as they now
would only test the proper operation of the metrics recorder.
This ends up dropping all the new tests, as they just worked on the
`BUFFER_COUNTERS` map which was able to be removed with this change.
The values in the `total_*` counters are manipulated identically to the other
fields, so we can drop them now.
@bruceg bruceg requested a review from pront August 6, 2025 20:03
@bruceg bruceg requested a review from a team as a code owner August 6, 2025 20:03
@bruceg bruceg added type: tech debt A code change that does not add user value. domain: buffers Anything related to Vector's memory/disk buffers complexity: moderate Moderate complex issue, that takes a fair bit of work to solve. labels Aug 6, 2025
pront
pront previously approved these changes Aug 6, 2025
Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gauges have their values set instead of incremented from the same values. This is the part that really doesn't make sense to me, but is an outflow of all the refactoring.

This is concerning but also it an existing issue.

@bruceg bruceg added the no-changelog Changes in this PR do not need user-facing explanations in the release changelog label Aug 6, 2025
@vparfonov
Copy link
Contributor

2. The counters are updated using a saturating add which prevents them from overflowing, which is good, but does nothing to deal with negatives which were the heart of Vector buffer metrics report negative values #21702 that prompted the above two PRs.

@bruceg
Take a look please here: https://github.com/vectordotdev/vector/pull/23453/files#diff-cd5d7e9d7dc7fc3347ca77c5322795f9115cfc8d0a71375d91de45d9d810ff51R20
value will be in range: [0, i64::MAX]

@vparfonov
Copy link
Contributor

The gauges have their values set instead of incremented from the same values. This is the part that really doesn't make sense to me, but is an outflow of all the refactoring.

This is concerning but also it an existing issue.

The main reason to use set() instead of increment() is to prevent the value from becoming negative. The set() method allows us to validate the number first, ensuring it's always a correct and safe value.

@bruceg
Copy link
Member Author

bruceg commented Aug 6, 2025

  1. The counters are updated using a saturating add which prevents them from overflowing, which is good, but does nothing to deal with negatives which were the heart of Vector buffer metrics report negative values #21702 that prompted the above two PRs.

@bruceg Take a look please here: https://github.com/vectordotdev/vector/pull/23453/files#diff-cd5d7e9d7dc7fc3347ca77c5322795f9115cfc8d0a71375d91de45d9d810ff51R20 value will be in range: [0, i64::MAX]

Are you saying that we are seeing values that are wrapping around an i64? That is, byte sizes totals larger than 9 exabytes (9e18)? I'm not clear the point you are making here.

@bruceg
Copy link
Member Author

bruceg commented Aug 6, 2025

The main reason to use set() instead of increment() is to prevent the value from becoming negative. The set() method allows us to validate the number first, ensuring it's always a correct and safe value.

If I traced the logic right, the upstream values for those increments are always positive as they are communicated in unsigned integers. Did I mix up a subtraction?

@vparfonov
Copy link
Contributor

  1. The counters are updated using a saturating add which prevents them from overflowing, which is good, but does nothing to deal with negatives which were the heart of Vector buffer metrics report negative values #21702 that prompted the above two PRs.

@bruceg Take a look please here: https://github.com/vectordotdev/vector/pull/23453/files#diff-cd5d7e9d7dc7fc3347ca77c5322795f9115cfc8d0a71375d91de45d9d810ff51R20 value will be in range: [0, i64::MAX]

Are you saying that we are seeing values that are wrapping around an i64? That is, byte sizes totals larger than 9 exabytes (9e18)? I'm not clear the point you are making here.

"You mentioned this doesn't deal with negatives, but I point to clamp(0, i64::MAX). This part of the code gives us the assurance that the result will not be negative."

@vparfonov
Copy link
Contributor

The main reason to use set() instead of increment() is to prevent the value from becoming negative. The set() method allows us to validate the number first, ensuring it's always a correct and safe value.

If I traced the logic right, the upstream values for those increments are always positive as they are communicated in unsigned integers. Did I mix up a subtraction?

Maybe my previous note was a bit unclear. The change from increment() to set() isn't about a single operation going negative. It's about making the metric more robust.

The old increment-based approach is fragile because it assumes we'll never miss an event and get them in correct time. With set(), we're reporting the actual, current state. This makes the metric self-healing and immune to missed events or race conditions.

@vparfonov
Copy link
Contributor

I think I found a small but critical issue. The gauge is not tracking the total buffer's size, it's only tracking the size of the most recent batch of events.
I added a new unit test that demonstrates this behavior.

For reference, here is the approach from my PR: https://github.com/vectordotdev/vector/pull/23518/files#diff-69337e368c06496cca89f39bdf229fb2ebe3c0413c40f3fc0cd9d19d4fdea177R53

/cc @bruceg @pront

    #[test]
    fn test_event_and_byte_size_gauges_are_overwritten() {
        use metrics::{Key, Label};
        use metrics_util::debugging::{DebugValue, DebuggingRecorder};
        use metrics_util::{CompositeKey, MetricKind};
        use ordered_float::OrderedFloat;

        let recorder = DebuggingRecorder::default();
        let snapshotter = recorder.snapshotter();
        let buffer_id = "test_buffer_overwrite";
        let stage_idx = 0;

        metrics::with_local_recorder(&recorder, || {
            emit(BufferEventsReceived {
                buffer_id: buffer_id.to_string(),
                idx: stage_idx,
                count: 100,
                byte_size: 2000,
            });
            emit(BufferEventsSent {
                buffer_id: buffer_id.to_string(),
                idx: stage_idx,
                count: 5,
                byte_size: 100,
            });
        });

        let expected_events_count = 100.0 - 5.0; //95.0
        let expected_byte_size = 2000.0 - 100.0; //1900.0

        let metrics = snapshotter.snapshot().into_vec();

        let events_key = CompositeKey::new(
            MetricKind::Gauge,
            Key::from_parts(
                "buffer_events",
                vec![
                    Label::new("buffer_id", buffer_id.to_string()),
                    Label::new("stage", stage_idx.to_string()),
                ],
            ),
        );
        let final_events_gauge = metrics
            .iter()
            .find(|(key, _, _, _)| key == &events_key)
            .map(|(_, _, _, value)| value)
            .unwrap();

        assert_eq!(
            *final_events_gauge,
            DebugValue::Gauge(OrderedFloat(expected_events_count))
        );

        let bytes_key = CompositeKey::new(
            MetricKind::Gauge,
            Key::from_parts(
                "buffer_byte_size",
                vec![
                    Label::new("buffer_id", buffer_id.to_string()),
                    Label::new("stage", stage_idx.to_string()),
                ],
            ),
        );
        let final_bytes_gauge = metrics
            .iter()
            .find(|(key, _, _, _)| key == &bytes_key)
            .map(|(_, _, _, value)| value)
            .unwrap();

        assert_eq!(
            *final_bytes_gauge,
            DebugValue::Gauge(OrderedFloat(expected_byte_size))
        );
    }

@pront
Copy link
Member

pront commented Aug 7, 2025

I do appreciate the discussion here. Let's allow enough time to get this right. While we are busy preparing the next release, I have one time sensitive question. Are any of the findings here indicate that we have a gap on origin/master vs the previous Vector release? I tried going over everything and I think the answer is "no, we can go ahead with the current state of origin/master" but I wanted to double check with you folks. Thank you.

@vparfonov
Copy link
Contributor

I do appreciate the discussion here. Let's allow enough time to get this right. While we are busy preparing the next release, I have one time sensitive question. Are any of the findings here indicate that we have a gap on origin/master vs the previous Vector release? I tried going over everything and I think the answer is "no, we can go ahead with the current state of origin/master" but I wanted to double check with you folks. Thank you.

I believe we can go ahead, but @bruceg voice will be more valuable

@bruceg
Copy link
Member Author

bruceg commented Aug 7, 2025

"You mentioned this doesn't deal with negatives, but I point to clamp(0, i64::MAX). This part of the code gives us the assurance that the result will not be negative."

Granted, yes. However, the values being added are always converted from unsigned and so start as positive, so the saturating_add should already result with a positive.

Maybe my previous note was a bit unclear. The change from increment() to set() isn't about a single operation going negative. It's about making the metric more robust.

The old increment-based approach is fragile because it assumes we'll never miss an event and get them in correct time. With set(), we're reporting the actual, current state. This makes the metric self-healing and immune to missed events or race conditions.

Agree 100% here. My concern is that it appears, when all the refactoring is done, that they are set from the same value that had previously been used to increment them. That part doesn't make sense to me.

I do appreciate the discussion here. Let's allow enough time to get this right. While we are busy preparing the next release, I have one time sensitive question. Are any of the findings here indicate that we have a gap on origin/master vs the previous Vector release?

I think, given the above, there has been a regression merged albeit one that is not tested. It would be great to set up some test cases that assert the correct behavior, though, to be able to say for sure. The test above is a good start but unfortunately does not validate that the buffer mechanism itself adjusts the gauges to the correct values.

@pront
Copy link
Member

pront commented Aug 7, 2025

FYI: To mitigate risk, we’ll revert the merged changes in #23556.
This will allow us to address the buffer metrics issues properly without introducing regressions.

@vparfonov
Copy link
Contributor

vparfonov commented Aug 7, 2025

I think, given the above, there has been a regression merged albeit one that is not tested. It would be great to set up some test cases that assert the correct behavior, though, to be able to say for sure. The test above is a good start but unfortunately does not validate that the buffer mechanism itself adjusts the gauges to the correct values.

What regression do you mean? The solution that was merged into master was a necessary fix for a critical bug where the metrics were incorrect. It is fully thread-safe and its logic is confirmed by the unit tests were added in that same PR.

@pront pront self-requested a review August 7, 2025 20:37
@pront pront dismissed their stale review August 7, 2025 20:37

need to take a second look

@bruceg
Copy link
Member Author

bruceg commented Aug 7, 2025

What regression do you mean? The solution that was merged into master was a necessary fix for a critical bug where the metrics were incorrect. It is fully thread-safe and its logic is confirmed by the unit tests were added in that same PR.

Do you have a configuration or other test case that can reproduce the original problem of negative metrics reported? I understand that may be difficult if it is a race that produces them, but within the buffers crate we can control the two halves to make that less of an issue.

@bruceg bruceg marked this pull request as draft August 8, 2025 00:33
@bruceg
Copy link
Member Author

bruceg commented Aug 8, 2025

After re-examining all three patch sets, I noticed that I made a mistake in this step by dropping several negations. That skewed all the following refactoring steps and resulted in a broken result, which makes sense of my confusion at the start. I will have to rework this to resolve that.

I still believe there is a much simpler solution hiding in the resulting code, but I can see that the current code is more correct in two aspects than the original gauges:

  1. clamping off negatives and over-large values
  2. adding the buffer ID to the metrics to disambiguate

@bruceg
Copy link
Member Author

bruceg commented Aug 8, 2025

I will open a new PR when I have finished fixing the refactoring. I have now discovered two separate issues, the second of which is harder to compensate for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

complexity: moderate Moderate complex issue, that takes a fair bit of work to solve. domain: buffers Anything related to Vector's memory/disk buffers no-changelog Changes in this PR do not need user-facing explanations in the release changelog type: tech debt A code change that does not add user value.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants